Support Vector Machines (SVMs) are powerful supervised learning models used for both classification and regression problems. An SVM operates by finding the optimal boundary with the maximum margin (called a hyperplane) to separate class data (or respective regression regions). In a two-dimensional representation, a hyperplane is a straight line, in three-dimensions a hyperplane is a plane, and in higher-dimensional spaces a hyperplane is a hyperplane. The main purpose of the SVM model is to find a hyperplane that maximizes the margin between data points of different classes (and corresponding regression regions) in order to improve the generalization performance of the model on new (unseen) data.
Generally, SVMs are linear classifiers because the fundamental goal is to create a straight boundary to separate data. When the dataset is linearly separable, the SVM will find a hyperplane that separates the classes with the largest margin possible. The margin is a key component to minimizing the model’s sensitivity to subtle changes in the dataset as well as it is a guard against overfitting. If a line or plane can be used to cleanly separate both of the classes, the SVM will work quite well without the need to use complex nonlinear transforms.
Real-world data are often not neat and messy, so that they are not "linearly" separable; that is, a straight line cannot classify them correctly. SVMs acquire this ability to get around the data not being linearly separable through the kernel trick. The kernel trick is how SVMs can map the data into a higher dimensional space, where the data may be linearly separable. The kernel function produces this higher dimensional coordinate for the data, without actually computing the transformed line in the new coordinate; this advantage lessens the time computing time and amount of data. The kernel trick allows for SVMs to fit complicated patterns and boundaries without keeping the computational cost heavy and high.
The dot product drives the inner workings of SVMs and also plays an important role in almost all computations in SVMs when kernels are used. In SVMs, the margin and decision boundary are essentially defined by dot products between data points. The kernel trick simply works by substituting the dot product in the SVM with a specific kernel function that represents a dot product in larger dimensional space. This gives the flexibility that SVMs can learn non-linear relationships without actually manipulating larger dimensions of the data. Without both (the dot product and significance of the kernel trick), SVMs would lose too much of their flexibility and power.
The polynomial kernel takes the original feature space created by the input features and transforms it to a high-dimensional polynomial space. It also has the ability to model curved decision boundaries of the support vector classifier in many ways by combining input features in different polynomial ways. The polynomial kernel is mathematically written as:
K(x, x') = (x · x' + r)^d
where r is a constant term and d is the degree of the polynomial. A degree of 2, for example, introduces squared terms and interactions between features.
The Radial Basis Function (RBF) kernel - also known as the Gaussian kernel - sends data into infinite-dimensional space; it can create incredibly flexible, circular and/or blob-like decision boundaries. The RBF kernel is given by:
K(x, x') = exp( -gamma * ||x - x'||^2 )
where gamma controls how tightly the data points influence each other. The RBF kernel is powerful when the decision boundary is very complex and wavy.
Suppose you have a 2D point:
(x₁, x₂)
Using a polynomial kernel with parameters r = 1 and degree d = 2, the kernel function is:
K(x, x') = (x · x' + 1)^2
The feature mapping becomes:
Mapped Features:
x₁²
x₂²
√2 × x₁ × x₂
√2 × x₁
√2 × x₂
1 (bias term)
So if your original point is (2, 3):
x₁² = 4
x₂² = 9
√2 × 2 × 3 ≈ 8.485
√2 × 2 ≈ 2.828
√2 × 3 ≈ 4.242
1
Thus, the mapped 6D point is approximately:
(4, 9, 8.485, 2.828, 4.242, 1)
This higher-dimensional mapping enables linear separation in cases where the original data was not separable.