In the era where Neural Networks were ruling the field, in 1990 Vladimir Vapnik found one of the most versatile algorithm Support Vector Machines (SVM). The unique thing that makes SVM a good model for both classification and regression tasks is that, its versatility to model both linear and non-linear data. It is a pure mathematical model and has a strong theoretical foundation, making it a good candidate for further research and development. It is also often referred to as a black box algorithm due to its exceptional results and poor interpretability of results.
Before diving deep into SVM, it is important to understand the basic terminologies associated with SVM. Consider this example, which helps in walking through the important terminologies of SVM.
Hyperplane: Hyperplane is referred to as a decision boundary that separates various classes. In the case of 1D, it's a point, in the case of 2D it's a line, in the case of 3D it's plane and in other dimensions above it's called a hyperplane.
Support Vectors: Support vectors are the closest data points to the decision boundary. These support vectors have the most impact on determining the position of the decision boundary.
Margin: Margin is the distance between the hyperplane and the support vector. The main objective of SVM is to maximize the margin, which implies maximum separation between the classes.
In this case, there are two different classes( circle and triangle). The bold line which divides/separates these two lines is the hyperplane. It is also called as Maximum Margin Hyperplane as it separates the data points from different classes with the maximum possible margin. The dotted line above represents the Positive Hyperplane where the points above that line belong to the positive class and the dotted line below represents the Negative Hyperplane where the points above that line belong to the negative class. The two data points which are closest to the hyperplane are the support vectors and the maximum distance between these points is the margin.
SVM is one of the most commonly used supervised Machine learning algorithms, particularly effective for classification tasks. SVM is a core mathematical machine-learning model which works only on numeric/quantitative data. It can work on categorical data after proper preprocessing like one hot encoding, but it's not recommended as the results may not be reliable. While SVM can operate as a linear classifier in its basic form, it is also equipped to handle nonlinear relationships using the kernel trick. This enables SVM to function effectively in high-dimensional spaces, making it a versatile tool for various machine-learning challenges.
The main idea behind the SVM (Support Vector Machine) is to maximize the margin, that is, to maximize the distance between the closest data points of each class. These closest points are known as support vectors. The SVM algorithm identifies these points and uses them to construct two lines (or hyperplanes in higher dimensions): one for each class. These are called the positive and negative hyperplanes.
A third line (or hyperplane), which lies exactly midway between the positive and negative hyperplanes, is then drawn. This line is known as the maximum margin hyperplane because it maximizes the distance, or margin, between the classes. The maximum margin hyperplane acts as the decision boundary that separates the classes with as wide a gap as possible, thereby aiming to minimize classification errors, especially on new, unseen data.
SVM's are linear separators because the main objective of SVM is to find a hyperplane in N-dimensional space (N is number of features) that can separate/classify the data points based on their classes. It also aims to maximize the margin (distance between data points of both classes). So, in layman's terms, SVM aims to find a linear boundary line that can divide the two classes of data with a maximum distance between the two closest points from each class.
Now this may lead to another question. Does SVM's work only on linear data? Can't SVM be performed on non-linear data?
And the answer to that is no. SVM's can be used for both linear and non-linear data. How SVM's works on non-linear data is discussed below.
In the real world, not all data are linear or not always a data is linear. There is a high chance that data can be non-linear, which raises the question can SVM be performed in these cases? And the answer to this is yes, SVM's can be used on non-linear data using Kernel Trick.
SVM works by mapping the input data into a higher dimensional feature space where it can be more easily separated. The kernel trick is a method for achieving this mapping without actually computing the coordinates of the data in that higher dimension. How SVM operates is that it maps the input data into a higher dimension feature space, where it can be more easily separated. Kernels are mathematical techniques that transform data into higher dimensional space where the data can become linearly separable. So in simple terms, suppose the data is non-linear, it can be passed to Kernel functions which transform the data to higher dimensions helping SVM to find a decision boundary in a higher dimensional space rather than its original feature space. The kernel trick is a method for achieving this mapping without actually computing the coordinates of the data in that higher dimension. The advantage of the kernel function is that the transformation function is not needed, which is already taken care of which makes SVM perform well on vast data and makes it less computationally expensive.
In the given example, the red squares and green circles represent two different classes. It is evident that the distribution of data points is non-linear, making it challenging for a linear model to classify the data effectively. In such scenarios, SVMs face difficulty in drawing a linear decision boundary directly in the original feature space.
This is where the kernel function plays a critical role. By applying a kernel function, SVM can transform the original data into a higher-dimensional space. In this new space, the previously complex and intertwined data distribution may become simpler, often allowing for a linear separation between the classes.
Essentially, the kernel function enables the SVM to find a hyperplane in the transformed feature space that effectively separates the data into their respective classes. This process demonstrates the power of SVMs in handling non-linear data by leveraging kernel tricks to project data into higher dimensions where a linear separator is feasible.
In the example provided, the left plot shows the XOR function, which is linearly inseparable in two dimensions. However, by applying a kernel function and transforming the data into a higher-dimensional space (3D), SVMs can effectively draw a linear decision boundary. This transformation allows SVMs to handle the XOR problem by making the data linearly separable in the new space, enabling accurate classification despite the original complexity.
Without delving too deeply into the mathematical details, let's discuss the significance of the dot product in SVM. In SVM, the primary aim is to maximize the margin, the distance between the decision boundary, and the nearest data points from each class. Two forms of equations, the primal and the dual, are utilized to achieve this objective. And they both yield the same results. Dual form can be mathematically expressed as:
This equation states the objective of the SVM in its dual form. The last term (the dot product of xi and xj) in the right is very important because this allows performing kernel trick. This further helps SVM perform even on non-linear data. The specialty of kernels is that they transform the data points into higher dimensional space based on the kernel function used and return the output in the form of dot products. So this output of kernel functions replaces the dot product terms in the SVM equation. So, this is why the dot product is so critical to the use of the kernel in SVM.
There are many types of kernel functions. But the most commonly used Kernel functions are:
Linear Kernel
Polynomial Kernel
Radial Basis Function Kernel (RBF)
Sigmoid Kernel
The linear kernel is one of the simplest kernel functions. It is also a special case of the polynomial kernel with the value of parameter degree (d) as 1. Unlike other kernels that project data into higher-dimensional spaces, the linear kernel operates in the original feature space, computing the dot product of two input feature vectors. It is mathematically represented by:
K(x, y) = x.y + r
where K is the kernel function
x and y are feature vectors
r is a constant
The linear kernel is particularly useful in situations where the data is already linearly separable, meaning there is no need for a transformation to higher-dimensional space to achieve classification. This kernel effectively preserves the original geometry of the data, making it a straightforward but powerful tool for linear classification problems.
Linear Kernel
Polynomial Kernel
The polynomial kernel is one of the most used kernel functions. This kernel projects data into higher-dimensional spaces, making the job easier for the SVM to linearly separate the data. It is mathematically represented by:
K(x, y) = (x.y + r) ^d
where K is the kernel function
x and y are feature vectors
r is a constant
d is a degree parameter.
By taking the dot product of the feature vectors and adding a constant term r before raising the result to the power of d, the polynomial kernel effectively adds complexity to the model. It's important to choose the degree d judiciously, as a high degree can lead to overfitting, especially if the data set is not large enough to support such a high-dimensional representation.
The polynomial kernel effectively enables algorithms to learn non-linear decision boundaries, thus extending their applicability beyond linearly separable datasets. It is particularly useful in scenarios where the relationship between the input features and the target variable is expected to be non-linear.
Lets consider 3 2D points. A (0,0), B (0,1), C (1,0) D (1,1)
Let's assume a polynomial kernel with r=1 and degree d =2 is applied to this data.
So the equation of polynomial kernel with r = 1 and degree d = 2 is
K(x,y) = (x.y + 1)^2 = (x.y)^2 + 1 + 2x.y
The data points given in input space cannot be linearly separable by SVM. So after applying the kernel functions the data is transformed into higher dimensional space where a linear decision boundary can be established by SVM.
These transformed points are now represented in a 6-dimensional space. Each dimension is a part of the polynomial kernel's transformation, which effectively captures the interplay between the original dimensions, their squares, and cross-terms, scaled appropriately. This is how the polynomial kernel helps in creating nonlinear decision boundaries in the original space by operating in this higher-dimensional feature space.
Mathematical Representation of RBF Kernel
where gamma is scaling factor/similarity measure
a,b are data points.
The RBF kernel is also one of the most used kernel functions. It is also called a Gaussian Kernel. Like a polynomial kernel, this kernel too projects data into higher-dimensional spaces, making the job easier for the SVM to separate the data linearly.
The RBF kernel is especially useful for problems where the decision boundary is very irregular, as it can adapt to complex structures. However, it can also lead to overfitting if not properly regularized or if the gamma parameter is chosen too large.
The Sigmoid kernel is also one of the popular kernel functions. It is also called a Hyperbolic tangent Kernel. Like the other kernels, this kernel too projects data into higher-dimensional spaces, making the job easier for the SVM to separate the data linearly. As the name suggests, the function used in this kernel is Sigmoid.
This kernel is mainly useful in binary classification tasks as it transforms the data points into binary format. The advantage of this kernel is its computationally fast but the downside of this kernel is that, it is prone to overfitting if the parameters are not chosen properly.
SVM can be utilized for various purposes across multiple fields. Some of the common applications of SVM's include:
Face Detection
Bioinformatics
Protein fold and remote homology detection
Text and hypertext categorization
SVM's are versatile and memory-efficient.
SVM's are even effective in higher dimensional space.
It is sensitive to noise and doesn't perform well with overlapping data.
SVM's are hard to interpret