SVM

Study notes

Support Vector Machine part 1: The idea and the math

Support Vector Machine part 2: Kernel function

----------------------------------------------------------

More explanation for the part 1.

the reason that it converts the target from minimizing ||w|| to minimizing 1/2||W||^2 is to make it a quadratic programming problem.

With the square ^2 the convex goes up like a U shape, so the minimum value is at the bottom


With the Lagrange trick,

the max of L(w, b, a) -> 1/2||W||^2 when the linear constraints are satisfied

the max of L(w, b, a) -> infinity when the linear constraints are not satisfied

So, for every possive w, b, a, the min of the max of L(w, b, a), is when the linear constraints are satisfied

the min of max equals to the max of the min when the KKT conditions are met which is the case for the SVM target function.


so the solution to the SVM function, i.e. the L(w, b, a) is about finding the min of the functions, and then find the max.

Finding the min is easy. Simply calc the derivative w.r.t. the w and b. It reaches the min (the bottom of the U shape)

when derivative is 0. After that we get the max of the function w.r.t. a. This is easy. At the end, we have the w*, b* solutions.


The w* and b* define the dividing hyperplane's middle

    f = w*.x +b*

The w*.x +b* = 1 and w*.x +b*=-1 are the two dividing hyperplanes


Support vectors are those training records with the parameter a !=0 (see original notes for reason)

Above are for data that are linearly separable.


For non linearly separable data, we need a soft margin approach is needed, which allow some data points to be mis-classified, ie. 

on the wrong side of the hyperplane.

But an extra 'slack variable' is introduced (something like a penalty term), so the target function also needs to include the 

sum of the slace variable. The weight of the slack variable is controlled by a Parameter C.

If C is small then it allows more mis-classification and thus soft margin.

If C is large then it doesn't tolerate mis-classification much and thus hard margin.

--------------------------------------------------------

Gaussian Kernel / Radical Bias Kernel (RBF)

K(X1,x2)=exp(−γ||x1−x2||^2)

The X1 is a traning sample and X2 is a testing sample. We can substitute the dot product of X1 and X2 by the kernel function K(X1, X2) as it represents a dot product in an infinite dimensional space.

Why it represents an infinite dimensional space?

the exp() function can be represented by a polynomial with infinite terms according to Taylor Expansion, i.e. exp(x) = b0 + b1*x + b2*x^2 + .... bn * x^n

Therefore, each feature is expanded to n features x, x^2, x^3, ... x^n logically. But you don't need to calculate the x, x^2, x^3, ... x^n to figure out the distance between two samples. the exp(...) is equal to that distance.

The Gamma(γ) normalizes the distance ||x1−x2||^2. Gamma = 1/(2D^2), the D is the standard deviation.

So when manually setting Gamma for the kernel function, A big Gamma is equivalent to a small deviation, so only samples that are very close too each other would have a contribution to adding up the distance value. 

Another interpretation is that a training sample (support vector) has influence on only samples close to it. The consequence is that the dividing hyperplane will tend to overfit the training data by getting too close the training samples.

If Gamma is small, the deviation is big. A support vector will have influence on a bigger area (with sample further away). The consequence is that the dividing hyperplane will tend to be smoother, but it may lose the ability to capture a legitimate complex dividing hyperplane.

Therefore, there is a trade-off between too big and too small. The default Gamma is 1/#OfFeatures. A common way is to conduct a range search in a loop, e.g. 0.01 - 10 by 0.01, to determine the best performning Gamma.

The SVM parameter C is normally combined with Gamma in a grid search. e.g.

Gamma: 0.01 - 10 by 0.01 and C: 0.1 - 100 by 1