Generalized linear models

Generalized linear models - study notes

Linear model y = w0 + w1x1 + w2x2 + ... + wnxn

The target is to figure out the (w0, w1 ... wn) that achieves the best approximation of the given training data.

So you can use the linear model to predict the y value of future inputs (x1, x2,...xn).

The key is to define an objective function (or loss function) to measure the gap between the model and the training data, and then find out the (w0, w1 ... wn) that minimizes gap.

A straightforward measure is the distance from each training data point to the approximation curve/plane. The sum of the distance of all training data points to the curve/plane thus measures how well the model fits the training data. So here we have the first model.

1. Ordinary Least Squares

   This is the simplest linear regression model. The Loss is defined as the residual sum of squares (RSS) between the observed responses and the predicted responses.

   

   Loss = sum(y' - y)^2, where y' is the predicted value.

   

   The objective function is then to minimize the Loss.

   

   from sklearn import linear_model

   reg = linear_model.LinearRegression()

   reg.fit (X, y)

   reg.coef_       #this keeps the best (w1 ... wn) that minimizes the loss

   reg.intercept   #w0

   reg.predict(Z)  #predict the y value of a new set of data Z

   

   This model is simple but has its drawback. When some of the terms x1, x2, x3, etc... are correlated to each other(ie. multi-linearity), the model is very sensitive to outliers. A few random outliers can distort the model completely as every data point is considered in the loss function. The following has more details.

multi-linearity

or multicollinearity  or collinearity

   Assume a linear function y = w0 + w1A + w2B

   If A and B are independent variables, the training data points scatter in the whole space, because when A is small B can be either small or big and vice versa. The left figure below shows how data points can distribute. In this case linear regression is simple, the training will try to get the best w1 and w2 for A and B. Even you put in a third irrelevant variable C (also independent), the training will give a near zero weight to C. Also adding more noises to the training data wouldn't affect w1 and w2 too much so the model is quite stable. 

   

   If A and B are highly correlated, the problem kicks in. Firstly look at the distribution of the data points (right figure in below), they mostly skew in the middle ellipse area, because when A is small B is also small and when A is big B is also big. It pretty much means you can represent B by A. Assume the true function is y = 10 + A, if you try to train the model with both A and B (100% dependent on A), you can possibly get y = 10 + 2A - B or y = 10 - 0.5A + 1.5B. The weights of A and B can change dramatically due to noises in the training data set. The model is unstable and it's hard to interpret the impact (weight) of A and B because even the sign could change. Note that y = 10 + 2A - B and y = 10 - 0.5A + 1.5B are two completely different planes that intersect at the line y = 10 + A. 

Note collinearity may cause model training problem, but in theory it doesn't impact the prediction power, it only impacts the interpretation of variables because of weights are so easily changing. 

  

   When it's used for a classification problem.

   Again lets look at a model with two variables A and B, if A and B are not independent of each other, the distribution of class 1 (red rectangle) and class 0 (green circle) scatter around the space, because a bigger/smaller A doesn't imply a bigger/smaller B. Therefore the data points spread over a wide area (see the left of the below figure). Finding the best model is equivalent to finding a plane (the blue bold line) separating the rectangles and circles. The blue line is w1A + w2B = c, the area above is w1A + w2B > c and the area below is w1A + w2B < c.

   Note that if the separating plane leans / moves a bit due to noises, it doesn't matter much because the rectangles and circles are pretty well apart already.

   

   However, if A correlates to B, which means when A grows/drops then B grows/drops too. The data points (both red rectangles and green circles) gather around a line in the middle somewhere (see the right of the below figure). In this case the separating plane (the blue bold line) is very sensitive to adjustments. An outlier (e.g. green circle x) can affect the loss function value and pull the separating line downwards. And because the data points are so close to each other, a slight move of the separating plane can change the model dramatically (move across a lot more data points). Therefore a model built on correlated data variables tend to be fragile and inaccurate.

   One way to remedy the multi-linearity problem is increasing the size of training data. When good data dominates the noises, the situation can get better. 

    

   Correlated data variables are very common, and a good way to handle it is regularization (adding penalty term to the model). The following Ridge Regression is a type of regression with regularization.

   

2. Ridge Regression

   This model still uses RSS and adds an extra penalty term.

   

   Loss = sum(y' - y)^2 + alpha * sum(w^2)

   

   The sum(w^2) is the sum of the square of all coefficients (w0, w1, w2, ... wn). It's also called L2 regularization because is the square of the coefficients.

   alpha is a tuning parameter to assign to weight to the penalty term.

Why penalty term helps?

   In this way, while minimizing the Loss function, it needs to prefer smaller coefficients(w0, w1, w2, ..., wn). It has two benefits at least. 

   No.1, In the above example y = 10 +A for multi-linearity, Ridge regression will try to give the same weight to A and B so it's preferring y = 10 + 0.5A + 0.5B because this gives a smaller sum of weight squares even there are noises. So the model is more stable and people can say A and B both contribute 0.5 positively to y. For classification problem, the separating plane thus becomes more stable and less sensitive to outliers because the Ridge penalty term helps to stabilise the weights of A and B.

   No.2, it tends to choose a simpler model and avoid overfitting in polynomial curve regression (has 2nd order or more of A and B). Bigger coefficients enable the model's best separating plane/line goes up or down with a bigger range and thus likely to overfit the training data yielding a more perfectly fitted curve just for the given training data points. In contrast, smaller coefficients make the separating plane (or fitted line) smoother and less likely to overfit. Note, this is actually more beneficial to polynomial curve fitting than linear model. Linear model is hard to overfit (only a straight line) but still distort due to outliers.

   

   from sklearn import linear_model

   reg = linear_model.Ridge(alpha = 0.5) #normally would try different alphas to determine to the best

   reg.fit (X, y)

   reg.coef_

   ...

   

   To get the best alpha, run the cross validation version of Ridge Regression.

   reg = linear_model.RidgeCV(alphas=[0.01, 0.1, 1, 5, 10], cv = 5) #put a list of alphas to test, and the folds of cv

   reg.fit(X, y)

   reg.alpha_ #this returns the best alpha from the given list

3. Lasso

   The full name is Least absolute shrinkage and selection operator. 

   The main difference to Ridge regression is that it uses sum(w) instead of sum(w^2) as the penalty term. It's also called L1 regularization as it uses simply the sum of all coefficients.

   

   Loss = 1/2n * sum(y' - y)^2 + alpha * sum(w)

   

   As it's not taking the square of the weight so it allows relatively bigger coefficients.

   The 1/2n factor is just a mathematical trick to ease the calculation. The Lasso / L1 regularization has a very nice feature to assign zero weights to non-important variables, achieving feature selection.

   

L1 achieves feature selection:

   Considering a 2-dimensional space, the L1 penalty sum(w) = w1 + w2 (correspond to variable A and B respectively) has the same value along a diamond shape (i.e the Manhattan distance). See the blue diamond in the left side of the below figure. When the L1 penalty increases (bigger coefficients) the diamond expands. The orange ellipse represents the Loss function value with respect to the optimal solution w'. The w1, w2 combinations on a ellipse have the same loss function value. The minimum of the loss function (RSS + L1 penalty) is achieved when the ellipse and the diamond first joins. It's like throwing a diamond to a wall. The pointy end of the diamond (on an axis) is more likely to hit the wall first, which means the optimal solution is more likely to be on an axis. Therefore the other axis would have a coefficient zero (non-important feature).

   So when there is multi-linearity, say A and B, only the most dominating variable, say A, will have a non-zero weight, B is likely to be given a zero weight. However, if B is only partially dependent on A, giving B a zero weight loses information that only B carries, so Lasso / dropping the correlated variables is not always good.

   

   Similarly the L2 penalty sum(w^2) = w1^2 + w2^2 has the same value along a circle (ie. Euclidean distance). See the blue circle in the right side of the below figure. The minimum of the loss (RSS + L2 penalty) is achieved when the circle and the ellipse first joins. The optimal solution is on the circle and is not necessarily on an axis. It's more precise but no feature selection, ie. non-important feature is likely to have a very small coefficient still.

   

   from sklearn import linear_model

   reg = linear_model.Lasso(alpha = 0.5)

   reg.fit(X, y)

   ...

   

   Also there is a CV version

   reg = linear_model.LassoCV(alphas=[0.01, 0.1, 1, 5, 10], cv = 5)

   reg.fit(X, y)

   ...

   

Comparison of weights trained by normal Linear Regression, Lasso and Ridge.

The normal Linear Regression is more unstable and can change with different noises

The Lasso will tend to give X2 and X3 zero weights.

The Ridge sometimes is very good giving X1 a weight near 3 and near zero weight to the other 2. sometimes is not so good, but generally more stable.

from sklearn.linear_model import LinearRegression, Lasso, Ridge

import numpy as np

 

size = 5000

np.random.seed(seed=1000)

 

X1 = np.random.normal(0, 1, size) + np.random.normal(0, .5, size)

X2 = X1 + np.random.normal(0, 1, size)

X3 = X1 + np.random.normal(0, 1, size)

Y = 3 * X1  + np.random.normal(0,.5, size)

X = np.array([X1, X2, X3]).T

  

lr = LinearRegression()

lr.fit(X,Y)

print('lr weights:{0}'.format(lr.coef_))

lr = Lasso()

lr.fit(X,Y)

print('Lasso weights:{0}'.format(lr.coef_))

lr = Ridge()

lr.fit(X,Y)

print('Ridge weights:{0}'.format(lr.coef_))

4. Elastic Net

   This is a combination of both L1 and L2 regularization. It's a trade-off between Lasso and Ridge that allows some level of feature selection as well as some of Ridge's stability.

   

   Loss = 1/2n * sum(y' - y)^2 + alpha * L1_ratio * sum(w) + alpha * (1 - L1_ratio) *  1/2 * sum(w^2) 

  

   The L1_ratio is between 0 and 1 and determines how much importance L1 regularization takes. In other models such as SVM, the penalty is written as "alpha * L1 + lambda * L2".

   Elastic Net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while Elastic Net is likely to pick both.

   Depending on the L1_ratio, the shape of the penalty term area varies between a diamond and a circle. Below is an illustration. Obviously it has partial properties of diamond and circle.

 

   from sklearn.linear_model import ElasticNet

   reg = ElasticNet(alpha = 1, l1_ratio = 0.5)

   reg.fit(X, y)

   ...

   

   Also there is a CV version  

   reg = ElasticNetCV(alphas=[0.01, 0.1, 1, 5, 10], l1_ratio = 0.5, cv = 5)

   reg.fit(X, y)

   ...