L1 & L2 Penalty

L1 and L2 penalty in linear regression/logistic regression. This is also called L1/L2 regularization in linear regression.

The loss function (normally the least square error: 1/n * SUM(|y - f(x)|^2) ) tends to overfit the training data which causes large coefficients.

so a general solution is to add a penalty component to the loss function. This penalty component is an equvalent of the prior probability

in a Bayesian view, check PRML for details.

L2 regularization:

   Loss function + c * SUM(|w|^2)     

Here w are the coefficients and c is a weight assign to the penalty.

The solution to the regression is the minimization of the loss function plus the penalty, so the penalty helps prevent w from getting too big (overfit)

L1 regularization:

   Loss function + c * SUM(|w|)      

Pretty much the same here except it uses the L1 distance of w.

The SUM(|w|^2) simulates the actual Euclidean distance, so the penalty component limits the solution in a sphere / hyper sphere.

The SUM(|w|) simulates the Manhanttan distance, so the penalty component limits the solution in a square / some sort of diamond structure in hyper space.

Within the sphere space, the optimal solution (minimum loss) could be anywhere on the sphere.

Within the diamond space, the optimal solution is "normally" at the corner of the diamond which always lies on an axis (one of the w). This is like

throwing a diamond to a wall and it's always a corner hits the wall first. This is a very nice feature. When the optimal solution is on an axis,

some other axis would have a zero value, which means those axis are not conitributing at all. Then we can safely remove those axis, i.e. remove the

irrelevant features. It's a common practice to use L1 penalty in linear regression to detect irrelevant features.

L2 however, has a better precision in measuring the penalty. Also it has an analytical solution so it's sometimes easier to compute the optimal solution.

L1 doesn't have an analytical solution so sometimes it's more costly, but if the features are sparse (only a small percentage are relevant), some sparsity

properties can be utlized to accerlerate the computation.

Finnaly, one may use L1 to prunce features, and then use L2 on the rest of the relevant features for better regression.

PS. Another way is to combine L1 and L2 (i.e. P* L1 + (1-P) * L2) and it's called Elastic Net.