The Coefficients' Shrinkage Path of Ridge Regression

For ridge regression, it uses a shrinkage method by adding a penalty on estimated coefficients.

Usually, ridge regression could deal with high dimensional data and correlated predictors.

High dimensional data here means that for the number of predictors p and number of observations n, p>n or p<n but not p<<n.

In these situations, ridge regression could outperform OLS regression, because OLS estimated coefficients will have a high variance, and make a bad prediction for test data.

To build a deeper understanding of Ridge Regression, it is meaningful to consider the shrinkage path of estimated coefficients of ridge ( we call them betas in the following exercise).

Shrinkage path here refers to given different lambda, the path of an estimated coefficient.

In practice, we can select the best lambda by cross-validation, 10-fold works well.

We use the following data-generating process as a baseline setting:

the true beta vector is (0.5 0.5 -0.5)

variance-covariance matrix sigma is

1). Why ridge regression needs standardized predictors

Here standardization means that make the variance of each predictor equal to 1, by dividing each predictor value by its standard deviation.

Here we first use our baseline setting and use ridge on standardized data and original data.

We could see that with higher variance(original data), the shrinkage patterns are less steep, which implies that the variance of x will influence the shrinkage effect on X.

Here we use a sigma matrix with different variances of each X.

Here the standardized data path keeps, and we could see that for the high variance X2, although beta2 is smaller than beta1, for a bigger lambda, beta1 converges faster to 0 than beta2.

This means if we want to eliminate the influence of predictor units ( for example, 1 dollar or 1000 dollar, cm or m, g or kg), we need to standardize our predictors.

2). The trend for highly correlated data

For highly correlated data, we use:

Here the correlation coefficients between x1 and x2 is 0.95. And we have path as:

We find that the paths for beta1 & 2 quickly converge to each other and then converge to 0 simultaneously.

This shows that for highly correlated predictors, ridge treats them as nearly the same predictor, and gives them the same shrinkage weight, which implies that the ridge regression model cannot be used for predictor selection.

And for a sigma setting makes all 3 predictors highly correlated:

We could see that all 3 betas converge to each other at first and then converge to 0 together, and this gives us the same conclusion as above.

In addition, we can see that when lambda = 0, the estimated ridge betas are equal to OLS betas. And because of the same scale of standardized data, the smaller one ( take OLS as baseline) converges to 0 quicker as lambda increases.

R Code (click to watch)

# test the shrinkage path of ridge

rm(list = ls())

library(MASS)

set.seed(777)

n <- 100

beta.true <- c(0.5, 0.5, -0.5)

sigma <- matrix(c(2,0.1,0.1,

0.1,2,0.1,

0.1,0.1,2) ,

nrow= 3, ncol= 3, byrow=TRUE)

sigma <- matrix(c(2,0.1,0.1,

0.1,4,0.1,

0.1,0.1,5) ,

nrow= 3, ncol= 3, byrow=TRUE)

sigma <- matrix(c(2,1.9,0.1,

1.9,2,0.1,

0.1,0.1,2) ,

nrow= 3, ncol= 3, byrow=TRUE)

sigma <- matrix(c(2,0.1,1.9,

0.1,2,0.1,

1.9,0.1,2) ,

nrow= 3, ncol= 3, byrow=TRUE)

sigma <- matrix(c(2,1.9,1.9,

1.9,2,1.9,

1.9,1.9,2) ,

nrow= 3, ncol= 3, byrow=TRUE)

mu <- rep(0,3)

data.generator <- function(n, sigma, mu, beta){

x <- mvrnorm(n, mu, sigma)

e <- rnorm(n, 0, sqrt(10))

y <- x %*% beta + e

x.1 <- x[,1] / sd(x[,1])

x.2 <- x[,2]/ sd(x[,2])

x.3 <- x[,3]/ sd(x[,3])

x.sd <- cbind(x.1, x.2, x.3)

data <- data.frame ("y" = y, "x.sd" = x.sd, "x" = x)

return(data)

}

data <- data.generator(n, sigma, mu, beta.true)

x.sd <- cbind(data$x.sd.x.1, data$x.sd.x.2, data$x.sd.x.3)

x <- cbind(data$x.1, data$x.2, data$x.3)

ridge <- function(x,y,lambda, p = 3){

beta.ridge <- solve(t(x) %*% x +

diag(lambda, nrow = p, ncol = p)) %*% t(x) %*% y

return(beta.ridge)

}

grid <- 10^ seq (3, -3, length = 100)

beta.ridge <- matrix(NA, nrow = 3, ncol= length(grid))

beta.ridge.nonsd <- matrix(NA, nrow = 3, ncol= length(grid))

for (i in 1:length(grid)){

lam <- grid[i]

beta.ridge[,i] <- ridge(x.sd, data$y, lam)

}

for (i in 1:length(grid)){

lam <- grid[i]

beta.ridge.nonsd[,i] <- ridge(x, data$y, lam)

}

ylimits = c(-1.5, 1.5)

plot(x=grid, y=beta.ridge[1,], col = "red", ylim = ylimits,

xlab = expression(lambda), ylab = "",

main = expression(hat(beta) ~ "for different" ~ lambda), type = "n")

grid()

points(x=grid, y=beta.ridge[1,], col = "orange", lwd = 1)

points(x=grid, y=beta.ridge[2,], col = "red", lwd = 1)

points(x=grid, y=beta.ridge[3,], col = "darkblue", lwd = 1)

abline(h = 0, col = "black")

legend("topright", c(expression(beta[1]), expression(beta[2]), expression(beta[3])),

col = c("orange", "red", "darkblue"), pch = 1)

mtext("standardized data")

ylimits = c(-1.5, 1.5)

plot(x=grid, y=beta.ridge.nonsd[1,], col = "red", ylim = ylimits,

xlab = expression(lambda), ylab = "",

main = expression(hat(beta) ~ "for different" ~ lambda), type = "n")

grid()

points(x=grid, y=beta.ridge.nonsd[1,], col = "orange", lwd = 1)

points(x=grid, y=beta.ridge.nonsd[2,], col = "red", lwd = 1)

points(x=grid, y=beta.ridge.nonsd[3,], col = "darkblue", lwd = 1)

abline(h = 0, col = "black")

legend("topright", c(expression(beta[1]), expression(beta[2]), expression(beta[3])),

col = c("orange", "red", "darkblue"), pch = 1)

mtext("non-standardized data")