Programming 5 - Linear Regression

GOAL : Learn Linear Regression and how to use scikit-learn function to implement.

Learning experience: In this learning experience, I learned that linear regression is a method to find a line that can predict future data. We implemented this algorithm using scikit-learn, but ultimately found that ordinary linear regression does not fit real-world data well. Therefore, we implemented a random forest algorithm, which achieved better results than linear regression. This assignment deepened my understanding of the differences between various machine learning algorithms and taught me how to choose the appropriate model based on real-world scenarios.

working environment :

OS: Windows 11 home

CPU : intel i9-13900k

GPU : Nvidia RTX 4090

Python Version : 3.12.2

Development environment: jupyter notebook.

Chapter 13. Linear Regression [1]

13.0 Introduction

In linear regression, the observations (red) are assumed to be the result of random deviations (green) from an underlying relationship (blue) between a dependent variable (y) and an independent variable (x). [7]

Linear regression is one of the simplest supervised learning algorithms in our toolkit. Linear regression and its extensions continue to be a common and useful method of making predictions when the target vector is a quantitative value (e.g., home price, age). In this chapter we will cover a variety of linear regression methods (and some extensions) for creating well-performing prediction models.

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.

This form of analysis estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. There are simple linear regression calculators that use a “least squares” method to discover the best-fit line for a set of paired data. You then estimate the value of X (dependent variable) from Y (independent variable).

Why linear regression is important?

Linear-regression models are relatively simple and provide an easy-to-interpret mathematical formula that can generate predictions. Linear regression can be applied to various areas in business and academic study.

You’ll find that linear regression is used in everything from biological, behavioral, environmental and social sciences to business. Linear-regression models have become a proven way to scientifically and reliably predict the future. Because linear regression is a long-established statistical procedure, the properties of linear-regression models are well understood and can be trained very quickly. [2]

In Scikit-learn it provide the module name sklearn.linear_model implements a variety of linear models. such as: [3]

Classical linear regressors

linear_model.LinearRegression(*[, ...]) : Ordinary least squares Linear Regression.
linear_model.Ridge([alpha, fit_intercept, ...]) : Linear least squares with l2 regularization.
linear_model.RidgeCV([alphas, ...]) : Ridge regression with built-in cross-validation.
linear_model.SGDRegressor([loss, penalty, ...]) : Linear model fitted by minimizing a regularized empirical loss with SGD.

Regressors with variable selection

linear_model.ElasticNet([alpha, l1_ratio, ...]) : Linear regression with combined L1 and L2 priors as regularizer.
linear_model.ElasticNetCV(*[, l1_ratio, ...]) : Elastic Net model with iterative fitting along a regularization path.
linear_model.Lars(*[, fit_intercept, ...]) : Least Angle Regression model a.k.a.
linear_model.LarsCV(*[, fit_intercept, ...]) : Cross-validated Least Angle Regression model.
linear_model.Lasso([alpha, fit_intercept, ...]) : Linear Model trained with L1 prior as regularizer (aka the Lasso).
linear_model.LassoCV(*[, eps, n_alphas, ...]) : Lasso linear model with iterative fitting along a regularization path.
linear_model.LassoLars([alpha, ...]) : Lasso model fit with Least Angle Regression a.k.a.
linear_model.LassoLarsCV(*[, fit_intercept, ...]) : Cross-validated Lasso, using the LARS algorithm.
linear_model.LassoLarsIC([criterion, ...]) : Lasso model fit with Lars using BIC or AIC for model selection.
linear_model.OrthogonalMatchingPursuit(*[, ...]) : Orthogonal Matching Pursuit model (OMP).
linear_model.OrthogonalMatchingPursuitCV(*) : Cross-validated Orthogonal Matching Pursuit model (OMP).

There are more model for regression, but in this chapter we will focus on LinearRegression, Ridge(13.4), Lasso(13.5).

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False), is an ordinary least squares Linear Regression. [4]

There are some important parameters :

fit_interceptbool, default=True : Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
copy_Xbool, default=True : If True, X will be copied; else, it may be overwritten.
n_jobsint, default=None : The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly n_targets > 1 and secondly X is sparse or if positive is set to True. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
positivebool, default=False : When set to True, forces the coefficients to be positive. This option is only supported for dense arrays.

And some commonly used method :

fit(X, y[, sample_weight]) : Fit linear model.
predict(X) : Predict using the linear model.
score(X, y[, sample_weight]) : Return the coefficient of determination of the prediction.

class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=0.0001, solver='auto', positive=False, random_state=None), Linear least squares with l2 regularization. Minimizes the objective function: ||y - Xw||^2_2 + alpha * ||w||^2_2. [5]

There are some important parameters :

alpha{float, ndarray of shape (n_targets,)}, default=1.0 : Constant that multiplies the L2 term, controlling regularization strength. alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Ridge object is not advised. Instead, you should use the LinearRegression object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.
fit_interceptbool, default=True : Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. X and y are expected to be centered).
solver{‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’, ‘lbfgs’}, default=’auto’ : Solver to use in the computational routines:

- ‘auto’ chooses the solver automatically based on the type of data.
- ‘svd’ uses a Singular Value Decomposition of X to compute the Ridge coefficients. It is the most stable solver, in particular more stable for singular matrices than ‘cholesky’ at the cost of being slower.
- ‘cholesky’ uses the standard scipy.linalg.solve function to obtain a closed-form solution.
- ‘sparse_cg’ uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter).
- ‘lsqr’ uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest and uses an iterative procedure.
- ‘sag’ uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.
- ‘lbfgs’ uses L-BFGS-B algorithm implemented in scipy.optimize.minimize. It can be used only when positive is True.

All solvers except ‘svd’ support both dense and sparse data. However, only ‘lsqr’, ‘sag’, ‘sparse_cg’, and ‘lbfgs’ support sparse input when fit_intercept is True.

tolfloat, default=1e-4 : The precision of the solution (coef_) is determined by tol which specifies a different convergence criterion for each solver:
- ‘svd’: tol has no impact.
- ‘cholesky’: tol has no impact.
- ‘sparse_cg’: norm of residuals smaller than tol.
- ‘lsqr’: tol is set as atol and btol of scipy.sparse.linalg.lsqr, which control the norm of the residual vector in terms of the norms of matrix and coefficients.
- ‘sag’ and ‘saga’: relative change of coef smaller than tol.
- ‘lbfgs’: maximum of the absolute (projected) gradient=max|residuals| smaller than tol. see more.

class sklearn.linear_model.Lasso(alpha=1.0, *, fit_intercept=True, precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic'), Linear Model trained with L1 prior as regularizer (aka the Lasso). The optimization objective for Lasso is: (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1. [6]

There are some important parameters :

alphafloat, default=1.0 : Constant that multiplies the L1 term, controlling regularization strength. alpha must be a non-negative float i.e. in [0, inf). When alpha = 0, the objective is equivalent to ordinary least squares, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Instead, you should use the LinearRegression object.
fit_interceptbool, default=True : Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).
max_iterint, default=1000 : The maximum number of iterations.
tolfloat, default=1e-4 : The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol, see Notes below.
warm_startbool, default=False : When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See the Glossary. see more.

And some commonly used method :

fit(X, y[, sample_weight, check_input]) : Fit model with coordinate descent.
path(X, y, *[, l1_ratio, eps, n_alphas, ...]) : Compute elastic net path with coordinate descent.
predict(X) : Predict using the linear model.
score(X, y[, sample_weight]) : Return the coefficient of determination of the prediction. see more.

Here we brief introduction the formula of Linear Regression. [7]

Formulation

Given a data set {𝑦𝑖,𝑥𝑖1,…,𝑥𝑖𝑝}𝑛 / 𝑖=1 of n statistical units, a linear regression model assumes that the relationship between the dependent variable y and the vector of regressors x is linear. This relationship is modeled through a disturbance term or error variable ε — an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. Thus the model takes the form 𝑦𝑖=𝛽0+𝛽1𝑥𝑖1+⋯+𝛽𝑝𝑥𝑖𝑝+𝜀𝑖=𝑥𝑖𝑇𝛽+𝜀𝑖 , 𝑖=1,…,𝑛, where T denotes the transpose, so that xiTβ is the inner product between vectors xi and β. Often these n equations are stacked together and written in matrix notation as 𝑦=𝑋𝛽+𝜀,where 𝑦=[𝑦1𝑦2⋮𝑦𝑛], 𝑋=[𝑥1𝑇𝑥2𝑇⋮𝑥𝑛𝑇]=[1𝑥11⋯𝑥1𝑝1𝑥21⋯𝑥2𝑝⋮⋮⋱⋮1𝑥𝑛1⋯𝑥𝑛𝑝], 𝛽=[𝛽0𝛽1𝛽2⋮𝛽𝑝],𝜀=[𝜀1𝜀2⋮𝜀𝑛].

Estimation methods

A large number of procedures have been developed for parameter estimation and inference in linear regression. These methods differ in computational simplicity of algorithms, presence of a closed-form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions needed to validate desirable statistical properties such as consistency and asymptotic efficiency.

Some of the more common estimation techniques for linear regression are summarized below.

Least-squares estimation and related techniques

Assuming that the independent variable is 𝑥𝑖=[𝑥1𝑖,𝑥2𝑖,…,𝑥𝑚𝑖] and the model's parameters are 𝛽=[𝛽0,𝛽1,…,𝛽𝑚] , then the model's prediction would be

𝑦𝑖≈𝛽0+∑𝑗=1𝑚𝛽𝑗×𝑥𝑗𝑖.

If 𝑥𝑖→ is extended to 𝑥𝑖→=[1,𝑥1𝑖,𝑥2𝑖,…,𝑥𝑚𝑖]

then 𝑦𝑖 would become a dot product of the parameter and the independent variable, i.e.

𝑦𝑖≈∑𝑗=0𝑚𝛽𝑗×𝑥𝑗𝑖=𝛽⋅𝑥𝑖.

In the least-squares setting, the optimum parameter is defined as such that minimizes the sum of mean squared loss:

𝛽^=arg min𝛽𝐿(𝐷,𝛽)=arg min𝛽→∑𝑖=1𝑛(𝛽→⋅𝑥𝑖→−𝑦𝑖)^2

Now putting the independent and dependent variables in matrices 𝑋 and 𝑌

respectively, the loss function can be rewritten as:

𝐿(𝐷,𝛽)=‖𝑋𝛽−𝑌‖2=(𝑋𝛽−𝑌)T(𝑋𝛽−𝑌)=𝑌T𝑌−𝑌T𝑋𝛽−𝛽T𝑋T𝑌+𝛽T𝑋T𝑋𝛽

As the loss is convex the optimum solution lies at gradient zero. The gradient of the loss function is (using Denominator layout convention):

∂𝐿(𝐷,𝛽)/∂𝛽=∂(𝑌T𝑌−𝑌T𝑋𝛽−𝛽T𝑋T𝑌+𝛽T𝑋T𝑋𝛽)/∂𝛽=−2𝑋T𝑌+2𝑋T𝑋𝛽

Setting the gradient to zero produces the optimum parameter:

−2𝑋T𝑌+2𝑋T𝑋𝛽=0⇒𝑋T𝑋𝛽=𝑋T𝑌⇒𝛽^=(𝑋T𝑋)^−1𝑋T𝑌

Note: To prove that the 𝛽^ obtained is indeed the local minimum, one needs to differentiate once more to obtain the Hessian matrix and show that it is positive definite. This is provided by the Gauss–Markov theorem.

Here is the algorithm workflow :

Data Collection: The first step is to collect the data needed for the analysis. This includes both the dependent variable (the variable you want to predict) and one or more independent variables (the variables used to predict the dependent variable).
Data Preprocessing: Once the data is collected, it may need to be preprocessed to ensure its quality and suitability for analysis. This can include handling missing values, removing outliers, and scaling or normalizing the variables if necessary.
Splitting Data: After preprocessing, the data is typically split into two subsets: a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate its performance. This helps assess how well the model generalizes to new, unseen data.
Model Building: With the training data, the linear regression model is built. This involves estimating the coefficients (intercept and slopes) that best fit the relationship between the independent variables and the dependent variable. The model aims to minimize the difference between the observed and predicted values of the dependent variable.
Model Evaluation: Once the model is trained, it's evaluated using the testing data. Common evaluation metrics for linear regression include mean squared error (MSE), root mean squared error (RMSE), and R-squared (coefficient of determination). These metrics help assess how well the model fits the data and how accurately it predicts the dependent variable.
Model Interpretation: After evaluating the model's performance, it's important to interpret the results. This includes understanding the significance and interpretation of the model coefficients, as well as assessing the overall fit of the model to the data.
Prediction: Finally, the trained model can be used to make predictions on new, unseen data. Given the values of the independent variables, the model can predict the value of the dependent variable.
Model Deployment: If the model performs well and meets the desired criteria, it can be deployed for use in real-world applications. This may involve integrating it into software systems or using it to inform decision-making processes.

Overall, the linear regression algorithm works by fitting a straight line to the data, allowing us to understand and predict the relationship between the independent and dependent variables.

13.1 Fitting a Line

This section will train a model that represents a linear relationship between the feature and target vector.

In 1 a linear regression model is created and fitted using Scikit-learn.

Firstly, the LinearRegression class is imported from the sklearn.linear_model library, and the make_regression function is imported from the sklearn.datasets library.

Then, the make_regression function is used to generate a features matrix (features) and a target vector (target). This function generates synthetic data with a specific number of samples, features, informative features, targets, and noise.

Next, a LinearRegression object is created, which will be used to train the linear regression model.

Finally, the fit method is called to train the model using the features matrix and target vector.

Below figure shows the feature we create.

We have trained our model using only three features. This means our linear model will be:

where y^ is our target, xi is the data for a single feature, 𝛽1, 𝛽2, and 𝛽3 are the coefficients identified by fitting the model, and 𝜀 is the error. After we have fit our model, we can view the value of each parameter. For example, 𝛽0, also called the bias or intercept, can be viewed using intercept_ like In 8, means the intercept is -0.096.

And 𝛽1, 𝛽2 are shown using coef_ like In 9.

In 10 In our dataset, the target value is a randomly generated continuous variable.

In 11 using the predict method, we can predict the output based on the input features.

Not bad! Our model was off only by about 0.01!

The major advantage of linear regression is its interpretability, in large part because the coefficients of the model are the effect of a one-unit change on the target vector. Our model’s coefficient of the first feature was ~–0.02, meaning that we have the change in target for each additional unit change in the first feature.

In 12 Using the score function, we can also see how well our model performed on the data.

The default score for linear regression in scikit learn is R2, which ranges from 0.0 (worst) to 1.0 (best). As we can see in this example, we are very close to the perfect value of 1.0. However it’s worth noting that we are evaluating this model on data it has already seen (the training data), where typically we’d evaluate on a held-out test set of data instead. Nonetheless, such a high score would bode well for our model in a real setting.

13.2 Handling Interactive Effects

This section will learn when you have a feature whose effect on the target variable depends on another feature.

In 13 creates a polynomial regression model using Scikit-learn and fits it on a synthetic dataset.

First, we generate a synthetic dataset containing two features and a target vector using the make_regression function.

Then, we use PolynomialFeatures to create interaction terms. This allows us to combine the original features to generate more features, thus increasing the model's ability to capture nonlinear relationships. In this example, we set interaction_only=True to generate only interaction terms between features without including higher-order terms of each feature itself.

Finally, we create a linear regression model (LinearRegression) and fit the generated feature matrix and target vector to the model using the fit method.

Sometimes a feature’s effect on our target variable is at least partially dependent on another feature. For example, imagine a simple coffee-based example where we have two binary features—the presence of sugar (sugar) and whether or not we have stirred (stirred)—and we want to predict if the coffee tastes sweet. Just putting sugar in the coffee (sugar=1, stirred=0) won’t make the coffee taste sweet (all the sugar is at the bottom!) and just stirring the coffee without adding sugar (sugar=0, stirred=1) won’t make it sweet either. Instead it is the interaction of putting sugar in the coffee and stirring the coffee (sugar=1, stirred=1) that will make a coffee taste sweet. The effects of sugar and stirred on sweetness are dependent on each other. In this case we say there is an interaction effect between the features sugar and stirred.

We can account for interaction effects by including a new feature comprising the product of corresponding values from the interacting features:

where x1 and x2 are the values of the sugar and stirred, respectively, and x1x2 represents the interaction between the two.

In 17 In our solution, we used a dataset containing only two features. Here is the first observation’s values for each of those features.

In 18 To create an interaction term, we simply multiply those two values together for every observation.

In 19 We can then view the interaction term for the first observation.

However, while often we will have a substantive reason for believing there is an interaction between two features, sometimes we will not. In those cases it can be useful to use scikit-learn’s PolynomialFeatures to create interaction terms for all combinations of features. We can then use model selection strategies to identify the combination of features and interaction terms that produces the best model.

To create interaction terms using PolynomialFeatures, there are three important parameters we must set. Most important, interaction_only=True tells PolynomialFeatures to return only interaction terms (and not polynomial features, which we will discuss in Recipe 13.3). By default, PolynomialFeatures will add a feature containing 1s called a bias. We can prevent that with include_bias=False. Finally, the degree parameter determines the maximum number of features to create interaction terms from (in case we wanted to create an interaction term that is the combination of three features). In 20 We can see the output of PolynomialFeatures from our solution by checking to see if the first observation’s feature values and interaction term value match our manually calculated version.

13.3 Fitting a Nonlinear Relationship

This section will model a nonlinear relationship.

In 21 creates a polynomial regression model using Scikit-learn and fits it on a synthetic dataset.

First, we generate a synthetic dataset containing three features and a target vector using the make_regression function.

Then, we use PolynomialFeatures to create polynomial features. This allows us to combine the original features to generate more features with higher degrees, thus increasing the model's ability to capture nonlinear relationships. In this example, we set degree=3 to generate all possible combinations of cubic, quadratic, and linear terms of the original features.

Finally, we create a linear regression model (LinearRegression) and fit the generated polynomial feature matrix and target vector to the model using the fit method.

So far we have discussed modeling only linear relationships. An example of a linear relationship would be the number of stories a building has and the building’s height. In linear regression, we assume the effect of number of stories and building height is approximately constant, meaning a 20-story building will be roughly twice as high as a 10-story building, which will be roughly twice as high as a 5-story building. Many relationships of interest, however, are not strictly linear.

Often we want to model a nonlinear relationship—for example, the relationship between the number of hours a student studies and the score she gets on a test. Intuitively, we can imagine there is a big difference in test scores between students who study for one hour compared to students who did not study at all. However, there is a much smaller difference in test scores between a student who studied for 99 hours and a student who studied for 100 hours. The effect that one hour of studying has on a student’s test score decreases as the number of hours increases.

Polynomial regression is an extension of linear regression that allows us to model nonlinear relationships. To create a polynomial regression, convert the linear function we used in Recipe 13.1:

into a polynomial function by adding polynomial features:

where d is the degree of the polynomial. How are we able to use a linear regression for a nonlinear function? The answer is that we do not change how the linear regression fits the model but rather only add polynomial features. That is, the linear regression does not “know” that the x^2 is a quadratic transformation of x. It just considers it one more variable.

A more practical description might be in order. To model nonlinear relationships, we can create new features that raise an existing feature, x, up to some power: x^2, x^3, and so on. The more of these new features we add, the more flexible the “line” created by our model. To make this more explicit, imagine we want to create a polynomial to the third degree. In 27 For the sake of simplicity, we will focus on only one observation (the first observation in the dataset), x[0] = -0.61175641.

In 28 To create a polynomial feature, we would raise the first observation’s value to the second degree, x1^2 = 0.37424591.

In 29 This would be our new feature. We would then also raise the first observation’s value to the third degree, x1^3 = -0.22864734

In 30 By including all three features (x, x^2, and x^3) in our feature matrix and then running a linear regression, we have conducted a polynomial regression.

PolynomialFeatures has two important parameters. First, degree determines the maximum number of degrees for the polynomial features. For example, degree=3 will generate x^2 and x^3. Second, by default PolynomialFeatures includes a feature containing only 1s (called a bias). We can remove that by setting include_bias=False.

13.4 Reducing Variance with Regularization

This section will reduce the variance of your linear regression model.

We can use a learning algorithm that includes a shrinkage penalty (also called regularization) like ridge regression and lasso regression.

In 31 creates a Ridge regression model using Scikit-learn and fits it on a synthetic dataset.

First, we generate a synthetic dataset containing three features and a target vector using the make_regression function.

Then, we standardize the features using StandardScaler. Standardization ensures that the values of features have a mean of 0 and a standard deviation of 1, which helps improve the convergence speed and predictive performance of the model.

Next, we create a Ridge regression model (Ridge) and set an alpha parameter value of 0.5. Alpha is the regularization parameter of Ridge regression, used to control the complexity of the model and prevent overfitting.

Finally, we use the fit method to fit the standardized feature matrix and target vector to the Ridge regression model.

In standard linear regression the model trains to minimize the sum of squared error between the true (yi) and prediction (yi^) target values, or residual sum of squares (RSS):

Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to “shrink” the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:

where 𝛽j^ is the coefficient of the jth of p features and α is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparameter multiplied by the sum of the absolute value of all coefficients:

where n is the number of observations. So which one should we use? As a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between ridge and lasso’s penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, both ridge and lasso regressions can penalize large or complex models by including coefficient values in the loss function we are trying to minimize.

The hyperparameter, α, lets us control how much we penalize the coefficients, with higher values of α creating simpler models. The ideal value of α should be tuned like any other hyperparameter. In scikit-learn, α is set using the alpha parameter.

In 32 scikit-learn includes a RidgeCV method that allows us to select the ideal value for α.

In 33 We can then easily view the best model’s α value.

One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training.

13.5 Reducing Features with Lasso Regression

This section will simplify your linear regression model by reducing the number of features.

In 34 creates a Lasso regression model using Scikit-learn and fits it on a synthetic dataset.

First, we generate a synthetic dataset containing three features and a target vector using the make_regression function.

Next, we create a Lasso regression model (Lasso) and set an alpha parameter value of 0.5. Alpha is the regularization parameter of Lasso regression, used to control the complexity of the model and prevent overfitting.

Finally, we use the fit method to fit the standardized feature matrix and target vector to the Lasso regression model.

One interesting characteristic of lasso regression’s penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. In 35 For example, in our solution we set alpha to 0.5, and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model.

In 36 However, if we increase α to a much higher value, we see that literally none of the features are being used.

The practical benefit of this effect is that it means we could include 100 features in our feature matrix and then, through adjusting lasso’s α hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance while improving the interpretability of our model (since fewer features are easier to explain).

Exercise : Use the California housing dataset to implementation Linear Regression.

First introduce California Housing dataset

Data Set Characteristics:

Number of Instances:20640

Number of Attributes:8 numeric, predictive attributes and the target

Attribute Information:

MedInc median income in block group
HouseAge median house age in block group
AveRooms average number of rooms per household
AveBedrms average number of bedrooms per household
Population block group population
AveOccup average number of household members
Latitude block group latitude
Longitude block group longitude

Missing Attribute Values:None

This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the sklearn.datasets.fetch_california_housing function.

I got an error from loading dataset, HTTPError: HTTP Error 403: Forbidden solution [8]

In 8 first we can see the first 5 data

And then we visualising the Data.

From the above histograms of the different features, we can conclude that:

1. Features are distributed on very different scales

2. In HouseAge and HouseValue columns the values are capped at 50 and 5 respectively.

For better accuracy, we should preprocess those features. We can either perform feature engineering or clean those problematic instances.

Now we plot the housing value with respect to longitude and latitude i.e based on location.

The above plot displays the map of California, with the color map corresponding to house value and the radius of the circles corresponding to the population of the areas. Based on this plot, we can conclude that:

1. Houses near ocean value more.

2. House in high population density area also value more but the effect decreases as we move further away from the ocean.

3. And there are some outliers

Next, we will plot the correlation between the features against each other.

If we check the correlation against target, we can see that all the other features show somewhat weak correlation, except for MedInc (Median Income). Let’s explore further.

Below table display the numerical value of correlation of features against “target”

As expected MedInc(Median Income) show a strong correlation. [9]

Next we are going to train the model.

In 39 implements linear regression analysis on the California housing dataset.

Firstly, necessary libraries are imported, including NumPy, Pandas, Matplotlib, as well as modules from Scikit-Learn. Then, the fetch_california_housing function from Scikit-Learn is used to load the California housing dataset. This dataset contains housing prices in various regions of California along with various features related to housing prices.

The data is converted into a Pandas DataFrame, and the features (X) and target (y) are stored separately. Subsequently, the dataset is split into training and testing sets using the train_test_split function, with the testing set constituting 20% of the total data.

Next, a linear regression model is created and fitted to the training data. After fitting, the trained model is used to predict the target values for the testing data. The Mean Squared Error (MSE) is calculated to evaluate the performance of the model on the testing set.

Finally, the R^2 score is printed out. The R^2 score is a metric used to evaluate the predictive performance of the model, with a value closer to 1 indicating better predictive performance. r2 score : 0.575787706032451

In 47 we change the model to ridge, and try to find the best alpha.

In 71 we change the model to Lasso, and try to find the best alpha.

The under tabel show that we have tried r2_score's result, and we can see that in Ridge we can choose alpha = 500.0 can get the r2_score 0.5847, With Lasso we can choose alpha = 0.01 can get the r2_score 0.5845, which was close.

In 72 we implement RandomForestRegressor which has learned in chapter 14.5.

The under table we can see that randomforest algorithm is more better than linear regression, that is because in the real world dataset doesn't always present linear!

Reference :

[1] Chapter 13. Linear Regression , Machine Learning with Python - Theory and Implementation

[2] What is linear regression?, IBM

[3] sklearn.linear_model: Linear Models , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[4] sklearn.linear_model.LinearRegression , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[5] sklearn.linear_model.Ridge , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[6] sklearn.linear_model.Lasso , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[7] Linear regression , wikipedia.

[8] 匯入fetch_california_housing 加州房價資料集報錯解決（HTTPError: HTTP Error 403: Forbidden）, CSDN, 桂花很香,旭很美, 20240314

[9] Implementing Linear Regression on California Housing Dataset, Debarshi Raj Basumatary, Jul 18, 2023.

Page updated

Google Sites

Report abuse