Support vector machine or SVM method is used as a classifier, to split labeled data into different classes. However, it can be used for regression as well. In that case, it is called SVR or support vector regression. From a geometrical point of view, the main idea is to use a splitter e.g., a line, curve, plane, surface, etc. to maximize the distance between the nearest data point and the splitter. From this perspective, the margins, the distance between the nearest labeled points to the splitter, and the splitter are the most important objects to discuss. So we have the following main objects to further consider:
Classifier: Split two (or more) classes by a hyperplane
Margin: maximize the distance between nearest data point & hyperplane
Let us look at an example in the figures. Here we have a two-dimensional data set (i.e., we have two features) that is categorized into two classes, and are labeled by triangles and circles (or red and blue, respectively). As one can see the classes can be split by several lines (figure on the left). Each of these lines can introduce a classifier, but there is only one that can split the labeled data in a way that the distance between the nearest points to the line and the splitter-line is maximized. On the right-hand side, you can see where the labeled data are split by the blue line. The nearest point to the blue line lies on the orange lines. The distance between the nearest points and the blue line is equal to the distance between the blue line and the orange line. This distance is maximum among other choices. In layman's words, first, we are looking for the widest two-way road that can split the labeled data and do not hit them, and second, we pick the line in the middle of the lane as the splitter. The labeled data that lies on the orange line are called support vectors. In the picture, they are identified by their black outlines.
SVM has two interesting features that can help it to classify the labeled data for more general cases, where the data cannot nicely be split by a linear splitter. In particular, there are two methods, the kernel method, and the soft border method. The kernel method is based on mapping data to a new space and a soft border allows for a more tolerant splitter.
Kernel Trick: Mapping to another space (usually from lower dimensions to higher dimensions)
Soft border: Soft border or soft margin is when SVM ignores some outliers that can significantly lower the ability of the algorithm. This is somehow a regularization method to overcome the overfitting problem as well.
In many cases, one cannot classify two labels by a linear splitter (see the figure). However, a nonlinear splitter can probably do the job. There is a way to make this possible which is called the kernel trick, where one can move part of the points so that they can be split by a linear splitter.
Here is a visualized example. If one apply a mapping, one can bring a part of the red circles to the corner and then in the second figure one can split them by a linear splitter. Then one can apply the reverse mapping and get a nonlinear splitter.
One can look at the kernel trick by increasing the dimension of the feature set. This way one can lift different labels at different levels and split them by a linear splitter in a higher dimension space. For instance, here you can see red and blue labels. A circle can split the two, but a circle is not a linear splitter. The question is how one can find this circle. In the next slide, it is shown how by lifting the labeled date in the third dimension one can find a linear two-dimensional splitter to split the dots.
To justify the circle splitter in the previous figure one can increase the dimension. This means the mapping in the kernel trick can also be done to higher dimensional (or even infinite-dimensional) spaces. This is even more popular and useful in most cases. As we have seen, here there is no line or curve that can easily split the red and blue dots. But if we map the data to the three-dimensional space on a paraboloid, there is a plane that can split the red and the blue data as the red data are located higher than the blue ones. Then we map it back to the two-dimensional space. Now the splitter is a circle, not a line.
Sometimes the labeled data are so mixed that they cannot be easily split by any line curve, etc. In that case, it might be a good idea to be more tolerant and allow some of the labeled data to lie within the area of the other label. In that case, we accept a softer border, as opposed to the hard border. In the picture above, one can see that there are a few red and blue data that are located in the opposite area.
This method allows for some miss-classification in return to get a classifier. There is a parameter C determining that how much one needs to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller margin linear splitter if the splitter does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the model to look for a larger-margin splitter, even if more misclassification happens. Therefore,
High C = low bias, high variance.
Low C = high bias, low variance